This is a practice of RNA-seq pathway analysis using a small portion (6 out of 156) of GEO GSE135378 dataset. Differential expression analysis was done with DESeq2. Pathway analyses were performed for GO ontology and MSigDb gene sets.

Related paper: Pappalardi MB, et al. https://doi.org/10.1038/s43018-021-00249-x

The paper describes the discovery and systematic characterization of a novel small molecule DNA methyltransferase (DNMT) 1 inhibitor GSK3685032 and its anti-tumor activity in acute myeloid leukemia cell lines. Decitabine (DAC), a nucleoside analog and classic non-specific DNMT inhibitor, was used for comparison of the effects on DNA demethylation, transcription changes, DNA damage, and tumor inhibitory activity but not on specific gene expression.

In this exercise, we used NCBI-processed raw count. From NCBI:

Briefly, SRA runs where the organism is Homo sapiens and type is Transcriptomic are aligned to genome assembly GCA_000001405.15 using HISAT2. Runs that pass a 50% alignment rate are further processed with Subread featureCounts which outputs a raw count file for each run. For Human data, the Homo sapiens Annotation Release 109.20190905 was used for gene annotation. GEO further processes these SRR raw count files into GEO Series raw counts matrices. … In cases where there is more than one SRA run per GEO Sample, the raw counts are summed.

The dataset used here was from leukemia-derived macrophage cell line MV4-11 cells treated with 400 nM decitabine (DAC) or GSK3685032 (GSK in short hereafter) for 4 days together with vehicle (DMSO) control. In original data, each treatment has two set of replicates, each of which includes six samples (six SRRs). The NCBI raw count matrix has six count columns for these samples, two for each condition (DMSO, GSK, and DAC) with corresponding GSM numbers as column names.

The results were consistent with broad effects of DNA demethylation. However, this is a technical practice not aiming to reproduce the original analyses or draw any biological insights.

Load the libraries.

library(GEOquery)
library(AnnotationDbi)
library(org.Hs.eg.db)
library(DESeq2)
library(apeglm)
library(EDASeq)
library(limma)
library(clusterProfiler)
library(fgsea)
library(msigdbr)
library(ReactomePA)
library(vsn)
library(hexbin)
library(pheatmap)
library(dplyr)
library(stringr)
library(ggplot2)
library(RColorBrewer)
library(Glimma)
library(ggvenn)
library(plotly)

Data preparation

Dataset download

First, we download the NCBI raw count data and the parent GEO files for sample information.

Note:
1. GeneID (ENTREZID) are numbers;
2. Sample names are GSM*([0-9]) from GEO, not the actual sample names.

Also download MSigDB gene set set for pathway analysis. (Have to download manually from the website using an account).

# setwd("/N/slate/wenschen/pappa-paper")

# NCBI raw count data
url = "https://www.ncbi.nlm.nih.gov/geo/download/?type=rnaseq_counts&acc=GSE135378&format=file&file=GSE135378_raw_counts_GRCh38.p13_NCBI.tsv.gz"
download.file(url, destfile = "GSE135378_raw_counts_GRCh38.p13_NCBI.tsv.gz")
gunzip("GSE135378_raw_counts_GRCh38.p13_NCBI.tsv.gz")

# MSigDB gene sets (https://www.gsea-msigdb.org/gsea/msigdb)
# Curated set
# url = "https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2024.1.Hs/c2.all.v2024.1.Hs.entrez.gmt"

# # Hallmark set
# url = "https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2024.1.Hs/h.all.v2024.1.Hs.entrez.gmt"

# Reactome CP
#url = "https://www.gsea-msigdb.org/gsea/msigdb/download_file.jsp?filePath=/msigdb/release/2024.1.Hs/c2.cp.wikipathways.v2024.1.Hs.entrez.gmt"

raw = read.table("GSE135378_raw_counts_GRCh38.p13_NCBI.tsv", header = TRUE)
print(dim(raw))  # 39376 157
## [1] 39376   157
print(head(raw)[, 1:10])
##      GeneID GSM4007051 GSM4007052 GSM4007053 GSM4007054 GSM4007055 GSM4007056
## 1 100287102         32         26         24         24         43         38
## 2    653635       1240       1252       1183       1160       1500       1317
## 3 102466751         81         74         64         58         83         74
## 4 107985730         11          7          8          7         12         13
## 5 100302278          0          2          2          0          1          1
## 6    645520          1          0          0          0          0          0
##   GSM4007057 GSM4007058 GSM4007059
## 1         24         21         51
## 2       1051        997       1546
## 3         52         53         84
## 4          7          4          5
## 5          0          0          0
## 6          0          0          0

We need to download the parent file for metadata.

data_matrix = getGEO(GEO = "GSE135378", destdir = ".", GSEMatrix = TRUE, getGPL = FALSE)
## Found 1 file(s)
## GSE135378_series_matrix.txt.gz

Sample info

Prepare a sample info file for the count table, and a easy-to-read sample table for reference (can also be used for selecting data).

## file for count columns
sample_info = pData(phenoData(data_matrix[[1]]))[, c(1, 1)][1]
sample_info = data.frame(sample_info)
colnames(sample_info) = "details"
sample_info$details = sapply(sample_info, function(x) sub(" \\[.*", "", x))
write.table(sample_info, "sample_info.txt", 
            col.names = FALSE, quote = FALSE, sep = "\t")

## table
# Remove the "." in "3.2nM" for next step.
# Name the columns accordingly.
sample_info$details = sapply(sample_info$details, 
                             function(x) sub("3.2", "32", x, fixed = TRUE))
sample_info = cbind(rownames(sample_info), 
                    data.frame(str_split_fixed(sample_info$details, "\\.", 5)))
colnames(sample_info) = c("GSM", "cell_line", "dose", "drug", "day", "replicate")
# Get the "." back for "3.2nM".
sample_info$dose = as.numeric(gsub("nM", "", sample_info$dose))
sample_info$dose = sapply(sample_info$dose, function(x) ifelse(x == "32", "3.2", x))
sample_info$day = as.numeric(gsub("d", "", sample_info$day))
sample_info$replicate = as.numeric(gsub("n", "", sample_info$replicate))
# Rearrange the columns.
sample_info = sample_info[, c("GSM", "cell_line", "drug", "dose", "day", "replicate")]
write.table(sample_info, "sample_info_table.txt", 
            row.names = FALSE, quote = FALSE, sep = "\t")
print(head(sample_info))
##          GSM cell_line   drug dose day replicate
## 1 GSM4007051     MV411   DMSO    0   1         1
## 2 GSM4007052     MV411 GSK032  400   1         1
## 3 GSM4007053     MV411    DAC  400   1         1
## 4 GSM4007054     MV411   DMSO    0   2         1
## 5 GSM4007055     MV411 GSK032  400   2         1
## 6 GSM4007056     MV411    DAC  400   2         1

Prepare a small dataset

Only MV4-11, day4, DMSO, and 400 nM GSK or DAC treatment.

About the cell line (from ATCC):

MV-4-11 cells are macrophages that were isolated from the blast cells of a 10-year-old male with biphenotypic B-myelomonocytic leukemia and deposited by the Wistar Institute. Use these cells in cancer and immunology research.

sample_df = read.table("sample_info.txt")
colnames(sample_df) = c("GSM", "details")

## Check the corresponding names to make sure they match.
# The first column of data is GeneID.
# Total sample number is 156.
nrow(sample_df) == ncol(raw) - 1  # True
## [1] TRUE
all(sample_df$GSM == colnames(raw)[2:157]) # True 
## [1] TRUE
print(sum(sample_df$GSM == colnames(raw)[2:157]))  # 156
## [1] 156
## Replace and save.
colnames(raw)[2:157] = sample_df$details
# Take a look.
print(head(raw)[, 1:5])
##      GeneID MV411.0nM.DMSO.d1.n1 MV411.400nM.GSK032.d1.n1 MV411.400nM.DAC.d1.n1
## 1 100287102                   32                       26                    24
## 2    653635                 1240                     1252                  1183
## 3 102466751                   81                       74                    64
## 4 107985730                   11                        7                     8
## 5 100302278                    0                        2                     2
## 6    645520                    1                        0                     0
##   MV411.0nM.DMSO.d2.n1
## 1                   24
## 2                 1160
## 3                   58
## 4                    7
## 5                    0
## 6                    0
# Save the file.
write.table(raw, "GSE135378_raw_counts_GRCh38.p13_NCBI_renamed.tsv",
            row.names = FALSE, quote = FALSE, sep = "/t")

# Make a smaller table with the data of interest for next step.
# Day 4, MV4-11, for the Venn diagram in Fig. 6c from the paper.
cols_mv4d4 = grep("MV411", colnames(raw), value = TRUE) 
cols_mv4d4 = grep("d4", cols_mv4d4, value = TRUE)
mv4d4_raw = raw[, c("GeneID", cols_mv4d4)]
print(head(mv4d4_raw)[, 1:5])
##      GeneID MV411.0nM.DMSO.d4.n1 MV411.3.2nM.GSK032.d4.n1
## 1 100287102                   51                       49
## 2    653635                 1546                     1419
## 3 102466751                   84                       63
## 4 107985730                    5                        5
## 5 100302278                    0                        0
## 6    645520                    0                        1
##   MV411.16nM.GSK032.d4.n1 MV411.80nM.GSK032.d4.n1
## 1                      49                      48
## 2                    1474                    1553
## 3                      79                      99
## 4                      10                       7
## 5                       1                       1
## 6                       0                       0
# Let's make a smaller dataset: 400 nM and control only.
cols_mv4d4_400 = grepl("DMSO", colnames(mv4d4_raw)) | grepl("400nM", colnames(mv4d4_raw))
mv4d4_400_raw = cbind(GeneID = mv4d4_raw[, "GeneID"], mv4d4_raw[, c(cols_mv4d4_400)])
print(dim(mv4d4_400_raw))  # 39376 7
## [1] 39376     7
print(head(mv4d4_400_raw))
##      GeneID MV411.0nM.DMSO.d4.n1 MV411.400nM.GSK032.d4.n1 MV411.400nM.DAC.d4.n1
## 1 100287102                   51                       54                    78
## 2    653635                 1546                     1308                  1570
## 3 102466751                   84                       80                    74
## 4 107985730                    5                       16                    18
## 5 100302278                    0                        3                     0
## 6    645520                    0                        1                     1
##   MV411.0nM.DMSO.d4.n2 MV411.400nM.GSK032.d4.n2 MV411.400nM.DAC.d4.n2
## 1                   30                       46                    79
## 2                  925                      856                  2107
## 3                   46                       44                   120
## 4                    2                        8                    22
## 5                    0                        0                     1
## 6                    1                        1                     2
# Now we only have 6 samples, let's simplify the columns names.
print(colnames(mv4d4_400_raw))
## [1] "GeneID"                   "MV411.0nM.DMSO.d4.n1"    
## [3] "MV411.400nM.GSK032.d4.n1" "MV411.400nM.DAC.d4.n1"   
## [5] "MV411.0nM.DMSO.d4.n2"     "MV411.400nM.GSK032.d4.n2"
## [7] "MV411.400nM.DAC.d4.n2"
# [1] "GeneID" "MV411.0nM.DMSO.d4.n1"     "MV411.400nM.GSK032.d4.n1" "MV411.400nM.DAC.d4.n1"   
# [4] "MV411.0nM.DMSO.d4.n2"     "MV411.400nM.GSK032.d4.n2" "MV411.400nM.DAC.d4.n2" 
colnames(mv4d4_400_raw) = c("GeneID", "DMSO_1", "GSK_1", "DAC_1", "DMSO_2", "GSK_2", "DAC_2")
mv4d4_400_raw2 = mv4d4_400_raw  # Make a copy just in case
# Arrange the columns
mv4d4_400_raw = mv4d4_400_raw[, c("GeneID", "DMSO_1", "DMSO_2", "DAC_1",  "DAC_2", "GSK_1", "GSK_2")]
print(colnames(mv4d4_400_raw))
## [1] "GeneID" "DMSO_1" "DMSO_2" "DAC_1"  "DAC_2"  "GSK_1"  "GSK_2"
write.table(mv4d4_400_raw, "MV4-11_raw_counts_NCBI_d4_400nm.tsv",
            row.names = FALSE, quote = FALSE, sep = "\t")

Creating a sample table for DESeq

Sample names as index (row name);
Condition(treatment) as level for comparison.

sample_table = data.frame(condition = c("DMSO", "DMSO", "DAC", "DAC", "GSK", "GSK"))
rownames(sample_table) = colnames(mv4d4_400_raw[2:7])
sample_table$condition = as.factor(sample_table$condition)
sample_table$batch = as.factor(str_split_i(rownames(sample_table), "_", 2))
print(sample_table)
##        condition batch
## DMSO_1      DMSO     1
## DMSO_2      DMSO     2
## DAC_1        DAC     1
## DAC_2        DAC     2
## GSK_1        GSK     1
## GSK_2        GSK     2

create DESeq dataset object

First, construct a DESeq dataset from the count matrix, sample table, and set the comparison condition. Include batch in the design formula.

# Construct the count matrix. Same as above, but remove the gene ID column name.
count_matrix = mv4d4_400_raw[, 2:7]
rownames(count_matrix) = mv4d4_400_raw[, 1]
print(head(count_matrix))
##           DMSO_1 DMSO_2 DAC_1 DAC_2 GSK_1 GSK_2
## 100287102     51     30    78    79    54    46
## 653635      1546    925  1570  2107  1308   856
## 102466751     84     46    74   120    80    44
## 107985730      5      2    18    22    16     8
## 100302278      0      0     0     1     3     0
## 645520         0      1     1     2     1     1
# Make sure sample_table row names = column (sample) names of the matrix.
all(colnames(count_matrix) == rownames(sample_table))  # True
## [1] TRUE
# Now we can create the DESeq dataset object, adding batch into design.
dds = DESeqDataSetFromMatrix(countData = count_matrix,
                             colData = sample_table,
                             design = ~ batch + condition)
print(dds)
## class: DESeqDataSet 
## dim: 39376 6 
## metadata(1): version
## assays(1): counts
## rownames(39376): 100287102 653635 ... 4576 4571
## rowData names(0):
## colnames(6): DMSO_1 DMSO_2 ... GSK_1 GSK_2
## colData names(2): condition batch

Filtering read count

Filter the counts, removing low counts as recommended by the authors of DESeq.

# We have 3 conditions, each with two replicates. 
# Note that original authors set a much looser criterion (>= 2 reads) even with more samples.
keep = rowSums(counts(dds) >= 10) >= 2
sum(keep)  # 21700
## [1] 21700
# Total rows: 39376, so 39376 - 21700 = 17676 removed.
dds = dds[keep, ]

Differential expression analysis

DESeq task

# Tell DESeq the reference level (control).
# First check the condition.
print(dds$condition)
## [1] DMSO DMSO DAC  DAC  GSK  GSK 
## Levels: DAC DMSO GSK
print(dds$batch)
## [1] 1 2 1 2 1 2
## Levels: 1 2
# DMSO DMSO DAC  DAC  GSK  GSK 
# Levels: DAC DMSO GSK

# Set the condition.
dds$condition= factor(dds$condition, levels = c("DMSO", "DAC", "GSK"))
print(dds$condition)
## [1] DMSO DMSO DAC  DAC  GSK  GSK 
## Levels: DMSO DAC GSK
# Generate the log fold change table.
dds = DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
print(dds)
## class: DESeqDataSet 
## dim: 21700 6 
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(21700): 100287102 653635 ... 4576 4571
## rowData names(30): baseMean baseVar ... deviance maxCooks
## colnames(6): DMSO_1 DMSO_2 ... GSK_1 GSK_2
## colData names(3): condition batch sizeFactor

Get the results for GSK and DAC

Significance level set to 0.05 as that from original authors.

res_GSK = results(dds, contrast = c("condition", "GSK", "DMSO"), alpha = 0.05)
# Sort the rows by fold change.
res_GSK = res_GSK[order(res_GSK$log2FoldChange, decreasing = TRUE), ]
print(summary(res_GSK))
## 
## out of 21700 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 4685, 22%
## LFC < 0 (down)     : 4783, 22%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## NULL
res_DAC = results(dds, contrast = c("condition", "DAC", "DMSO"), alpha = 0.05)
# Sort the rows by fold change.
res_DAC = res_DAC[order(res_DAC$log2FoldChange, decreasing = TRUE), ]
print(summary(res_DAC))
## 
## out of 21700 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 5065, 23%
## LFC < 0 (down)     : 5212, 24%
## outliers [1]       : 0, 0%
## low counts [2]     : 0, 0%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## 
## NULL

Log fold change shrinkage

For visualization and ranking according to DESeq authors.

lfcShrink_GSK = lfcShrink(dds, coef = "condition_GSK_vs_DMSO", type = "apeglm")
lfcShrink_DAC = lfcShrink(dds, coef = "condition_DAC_vs_DMSO", type = "apeglm")

Quality plots

We look at some plots for data quality.

MA-plot

Most of the points (genes) are expected to be around the horizontal 0 line. True.

DESeq2::plotMA(lfcShrink_GSK, ylim = c(-3, 3))

DESeq2::plotMA(lfcShrink_DAC, ylim = c(-3, 3))

#### P-value distribution

Show expected results.

par(mfrow = c(1, 2))

ggplot(res_GSK, aes(x = pvalue)) + geom_histogram(bins = 100) +
    ggtitle("Distribution of raw p values") +
    theme_classic()

ggplot(res_DAC, aes(x = padj)) + geom_histogram(bins = 100) +
        ggtitle("Distribution of raw p values") + 
        theme_classic()

Raw and normalized counts

Relative log expression plots. Normalization worked well.

EDASeq::plotRLE(counts(dds), 
        outline = FALSE, 
        col = colData(dds)$condition,
        ylim = c(-3, 3),
        main = "Raw Counts")

EDASeq::plotRLE(counts(dds, normalized = TRUE), 
        outline = FALSE, 
        col = colData(dds)$condition,
        ylim = c(-3, 3),
        main = "Normalized Counts")

Heatmaps of transformed count matrix

Replicates have variations with a pattern that seems to suggest batch effect, but overall they were clustered together.

Normalized counts.

select = order(rowMeans(counts(dds, normalized = TRUE)), decreasing = TRUE)[1:100]
col_df <- data.frame(colData(dds))["condition"]
pheatmap(assay(normTransform(dds))[select, ], scale = "row", 
         cluster_rows = FALSE, show_rownames = FALSE,
         annotation_col = col_df)

Counts with regularized log transformation(rlog).

# Regularized log transformation(rlog)
rlogT = rlog(dds)
pheatmap(assay(rlogT)[select, ], scale = "row", 
         cluster_rows = FALSE, show_rownames = FALSE,
         annotation_col = col_df)

Counts with variance_stabilizing transformation(vst).

# Regularized log transformation(rlog)
vsd = vst(dds)
pheatmap(assay(vsd)[select, ], scale = "row", 
         cluster_rows = FALSE, show_rownames = FALSE,
         annotation_col = col_df)

Use limma removeBatchBatchEffect to see if batch effect indeed exists.

Hard to tell here, but clearer look at the PCA plot below.

vsd2 = vsd
mat = assay(vsd2)
mmatrix = model.matrix(~ condition, colData(vsd2))
mat = removeBatchEffect(mat, batch = vsd2$batch, design = mmatrix)
assay(vsd2) = mat
pheatmap(assay(vsd2)[select, ], scale = "row", 
         cluster_rows = FALSE, show_rownames = FALSE,
         annotation_col = col_df)

Count PCA

Groups are well separated and replicates are close to each other. Also the variance was well captured by the first two components.

With rlog-transformation.

plotPCA(rlogT, ntop = 500, intgroup = "condition") + 
    theme_classic()

With vst.

plotPCA(vst(dds), ntop = 500, intgroup = "condition") + 
    theme_classic()

Use vst-transformed and limma-processed data fro PCA. Now the replicates are much closer for control and GSK.

plotPCA(vsd2, ntop = 500, intgroup = "condition") + 
    theme_classic()

sample-to-sample distance

Replicates are closer.

sampleDists = dist(t(assay(rlogT)))
sampleDistMatrix = as.matrix(sampleDists)
rownames(sampleDistMatrix) = paste(rlogT$condition, rlogT$batch, sep = "-")
colnames(sampleDistMatrix) = NULL
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows = sampleDists,
         clustering_distance_cols = sampleDists, 
         cols = colors)

Dispersion plot

Show typical result. Overall the data quality is good.

plotDispEsts(dds)

Annotation

Add gene symbol and name to the results.

annot_result = function(res) {
    res$Symbol = mapIds(org.Hs.eg.db,
                        keys = row.names(res),
                        column = "SYMBOL",
                        keytype = "ENTREZID",
                        multiVals = "first")
    
    res$GeneName = mapIds(org.Hs.eg.db,
                            keys = row.names(res),
                            column = "GENENAME",
                            keytype = "ENTREZID",
                            multiVals = "first")
    res = res[, c(7, 8, 1:6)]
    return(res)
}
res_GSK = annot_result(res_GSK)
res_DAC = annot_result(res_DAC)

About the matching of EntrezID and Symbol:

We would get quite a few NAs (~ 3%) in Symbol column:

print(nrow(res_GSK[is.na(res_DAC$Symbol), ]))
## [1] 723
print(nrow(res_DAC[is.na(res_DAC$Symbol), ]))
## [1] 723

This is because NCBI withdrew or replaced these EntrezIDs. So we can safely remove those entries.

res_GSK = res_GSK[!is.na(res_DAC$Symbol), ]
res_DAC = res_DAC[!is.na(res_DAC$Symbol), ]

print(dim(res_GSK))
## [1] 20977     8
print(dim(res_DAC))
## [1] 20977     8

Top 10 significantly (padj < 0.05) up-regulated genes.

print(res_GSK[res_GSK$padj < 0.05, ][1:10, c("Symbol", "GeneName", "log2FoldChange", "pvalue", "padj")])
## log2 fold change (MLE): condition GSK vs DMSO 
## Wald test p-value: condition GSK vs DMSO 
## DataFrame with 10 rows and 5 columns
##                 Symbol               GeneName log2FoldChange      pvalue
##            <character>            <character>      <numeric>   <numeric>
## 1769             DNAH8 dynein axonemal heav..        12.6256 3.00101e-18
## 168400           DDX53   DEAD-box helicase 53        11.7535 6.38616e-16
## 139135           PASD1 PAS domain containin..        10.9961 4.76582e-14
## 221400           TDRD6 tudor domain contain..        10.9548 9.18422e-17
## 4100            MAGEA1  MAGE family member A1        10.8775 9.69441e-14
## 107986400 LOC107986400 uncharacterized LOC1..        10.7869 1.71088e-13
## 140801          RPL10L ribosomal protein L1..        10.6207 4.03494e-13
## 503637          DUXAP8 double homeobox A ps..        10.6106 3.58835e-15
## 644623         TPTE2P2     TPTE2 pseudogene 2        10.6062 4.89920e-13
## 159163          RBMY1F RNA binding motif pr..        10.5356 6.75567e-13
##                  padj
##             <numeric>
## 1769      6.59129e-17
## 168400    1.16650e-14
## 139135    7.28811e-13
## 221400    1.80851e-15
## 4100      1.43793e-12
## 107986400 2.47178e-12
## 140801    5.62351e-12
## 503637    6.11683e-14
## 644623    6.77583e-12
## 159163    9.17385e-12
print(res_DAC[res_DAC$padj < 0.05, ][1:10, c("Symbol", "GeneName", "log2FoldChange", "pvalue", "padj")])
## log2 fold change (MLE): condition DAC vs DMSO 
## Wald test p-value: condition DAC vs DMSO 
## DataFrame with 10 rows and 5 columns
##             Symbol               GeneName log2FoldChange      pvalue
##        <character>            <character>      <numeric>   <numeric>
## 1769         DNAH8 dynein axonemal heav..       12.37099 1.38875e-17
## 168400       DDX53   DEAD-box helicase 53       11.08313 2.52659e-14
## 139135       PASD1 PAS domain containin..       10.56688 4.34106e-13
## 200772       UICLM up-regulated in colo..       10.15280 4.81644e-12
## 4100        MAGEA1  MAGE family member A1       10.10069 4.76705e-12
## 140801      RPL10L ribosomal protein L1..       10.02164 7.64920e-12
## 503637      DUXAP8 double homeobox A ps..        9.97498 1.39451e-13
## 4101        MAGEA2  MAGE family member A2        9.66811 4.62239e-11
## 266740     MAGEA2B MAGE family member A2B        9.64062 5.24897e-11
## 8991      SELENBP1 selenium binding pro..        9.56324 1.04546e-10
##               padj
##          <numeric>
## 1769   2.41668e-16
## 168400 3.24037e-13
## 139135 4.89356e-12
## 200772 4.93003e-11
## 4100   4.88178e-11
## 140801 7.69174e-11
## 503637 1.65451e-12
## 4101   4.30313e-10
## 266740 4.84438e-10
## 8991   9.28249e-10
print("Number of genes common to both drugs:")
## [1] "Number of genes common to both drugs:"
print(sum(res_GSK[1:10, ]$Symbol %in% res_DAC[1:10, ]$Symbol))
## [1] 6

Top 10 significantly (padj < 0.05) down-regulated genes.

print(tail(res_GSK[res_GSK$padj < 0.05, ], 10)[, c("Symbol", "GeneName", "log2FoldChange", "pvalue", "padj")])
## log2 fold change (MLE): condition GSK vs DMSO 
## Wald test p-value: condition GSK vs DMSO 
## DataFrame with 10 rows and 5 columns
##                 Symbol               GeneName log2FoldChange      pvalue
##            <character>            <character>      <numeric>   <numeric>
## 125704          DIPK1C divergent protein ki..       -4.13694 1.89200e-09
## 2563             GABRD gamma-aminobutyric a..       -4.15627 2.53000e-07
## 28832            IGLJ2 immunoglobulin lambd..       -4.15936 1.95013e-03
## 54768            HYDIN HYDIN axonemal centr..       -4.19980 1.69319e-02
## 28755             TRAC T cell receptor alph..       -4.23417 2.91930e-06
## 100874051    LZTS1-AS1  LZTS1 antisense RNA 1       -4.46116 1.16429e-02
## 105378210 LOC105378210 uncharacterized LOC1..       -4.60916 2.83952e-06
## 9834            FAM30A family with sequence..       -4.62309 1.08903e-02
## 8328             GFI1B growth factor indepe..       -5.11121 2.03254e-03
## 729224    KIAA2012-AS1 KIAA2012 antisense R..       -5.71228 3.58213e-03
##                  padj
##             <numeric>
## 125704    1.67100e-08
## 2563      1.61998e-06
## 28832     5.98894e-03
## 54768     4.01467e-02
## 28755     1.58412e-05
## 100874051 2.89869e-02
## 105378210 1.54237e-05
## 9834      2.74026e-02
## 8328      6.21650e-03
## 729224    1.03230e-02
print(tail(res_DAC[res_DAC$padj < 0.05, ], 10)[, c("Symbol", "GeneName", "log2FoldChange", "pvalue", "padj")])
## log2 fold change (MLE): condition DAC vs DMSO 
## Wald test p-value: condition DAC vs DMSO 
## DataFrame with 10 rows and 5 columns
##                 Symbol               GeneName log2FoldChange      pvalue
##            <character>            <character>      <numeric>   <numeric>
## 105371019 LOC105371019 uncharacterized LOC1..       -3.39580 1.90049e-02
## 6319               SCD stearoyl-CoA desatur..       -3.55122 1.44560e-22
## 167127          UGT3A2 UDP glycosyltransfer..       -3.55187 1.46825e-05
## 4843              NOS2 nitric oxide synthas..       -3.62293 1.69136e-13
## 107984343 LOC107984343 uncharacterized LOC1..       -3.64590 1.51609e-07
## 83729            INHBE inhibin subunit beta E       -3.81679 5.41816e-06
## 125704          DIPK1C divergent protein ki..       -3.89736 5.33969e-10
## 105378210 LOC105378210 uncharacterized LOC1..       -4.00699 1.72031e-06
## 81501          DCSTAMP dendrocyte expressed..       -4.01942 1.09540e-05
## 90271         OLMALINC oligodendrocyte matu..       -5.02066 5.50651e-33
##                  padj
##             <numeric>
## 105371019 4.12778e-02
## 6319      3.82090e-21
## 167127    6.34554e-05
## 4843      1.98821e-12
## 107984343 8.92543e-07
## 83729     2.50424e-05
## 125704    4.33325e-09
## 105378210 8.64538e-06
## 81501     4.84021e-05
## 90271     2.85182e-31
print("Number of genes common to both drugs:")
## [1] "Number of genes common to both drugs:"
print(sum(tail(res_GSK[res_GSK$padj < 0.05, ], 10)$Symbol %in% 
          tail(res_DAC[res_DAC$padj < 0.05,], 10)$Symbol))
## [1] 2

Save the files.

write.table(res_GSK, "MV411_day4_GSK032_400nM_NCBI_count_lfc.tsv",
            sep = "\t", quote = FALSE)
write.table(res_DAC, "MV411_day4_DAC_400nM_NCBI_count_lfc.tsv",
            sep = "\t", quote = FALSE)

Interactive Volcano plot

Hover the mouse over a point to see the gene symbol, fold change and -log10(padj).

plotVolcano = function(res_DE, condition) {
                      ggplot(res_DE,
                             aes(x = log2FoldChange,
                                 y = -log10(padj),
                                 label = Symbol,
                                 color = ifelse(res_DE$padj < 0.05 & abs(res_DE$log2FoldChange) > 1,
                                                    "yes", "not"))) +
                      geom_point() + 
                      labs(title = paste0(condition, " vs control"),
                          xlab = "log2 fold change", 
                          ylab = "-log10 adjusted p-value",
                          color = "> 2-fold & padj < 0.05") +
                      scale_color_manual(values = c("#56B4E9", "#D55E00")) +
                      theme(plot.title = element_text(size = 12, hjust = 0.5),
                            axis.title = element_text(size = 12),
                            legend.position = "right")
}                        
plotGSK = plotVolcano(res_GSK, "GSK")
ggplotly(plotGSK, tooltip = c("log2FoldChange", "-log10(padj)", "label"))
plotDAC = plotVolcano(res_DAC, "DAC")
ggplotly(plotDAC, tooltip = c("log2FoldChange", "-log10(padj)", "label"))

Venn diagrams for up- and down-regulated genes

There were more up-regulated genes than down-regulated genes.

par(mfrow = c(1, 2))

# Select genes with 2-fold changes and sort them decreasingly.
DE_gene_GSK = res_GSK[abs(res_GSK$log2FoldChange) >= 1 & res_GSK$padj < 0.05, ]
DE_gene_GSK = DE_gene_GSK[, c("Symbol", "log2FoldChange", "padj")]
DE_gene_GSK = DE_gene_GSK[!is.na(DE_gene_GSK$Symbol), ]
DE_gene_GSK = DE_gene_GSK[order(-DE_gene_GSK$log2FoldChange), ]
print(nrow(DE_gene_GSK))  # 3435 genes
## [1] 3144
DE_gene_DAC = res_DAC[abs(res_DAC$log2FoldChange) >= 1 & res_DAC$padj < 0.05, ]
DE_gene_DAC = DE_gene_DAC[, c("Symbol", "log2FoldChange", "padj")]
DE_gene_DAC = DE_gene_DAC[!is.na(DE_gene_DAC$Symbol), ]
DE_gene_DAC = DE_gene_DAC[order(-DE_gene_DAC$log2FoldChange), ]
print(nrow(DE_gene_DAC))  # 4015 genes
## [1] 3827
up_list = list(GSK = as.vector(DE_gene_GSK[DE_gene_GSK$log2FoldChange >= 1, ]$Symbol),
               DAC = as.vector(DE_gene_DAC[DE_gene_DAC$log2FoldChange >= 1, ]$Symbol))
print(sapply(names(up_list), function(x) length(up_list[[x]])))
##  GSK  DAC 
## 2477 3017
down_list = list(GSK = as.vector(DE_gene_GSK[DE_gene_GSK$log2FoldChange < 1, ]$Symbol),
                 DAC = as.vector(DE_gene_DAC[DE_gene_DAC$log2FoldChange < 1, ]$Symbol))
print(sapply(names(up_list), function(x) length(down_list[[x]])))
## GSK DAC 
## 667 810
ggvenn(up_list, 
       fill_color = c("white", "white"),
       stroke_size = 0.5,
       set_name_size = 6) + 
       ggtitle("Up-regulated genes") +
       theme(plot.title = element_text(size = 16, hjust = 0.5))

ggvenn(down_list, 
       fill_color = c("white", "white"),
       stroke_size = 0.5,
       set_name_size = 6) + 
       ggtitle("Down-regulated genes") +
       theme(plot.title = element_text(size = 16, hjust = 0.5))

Gene enrichment analysis

GO ontology

Use clusterProfiler package. This package runs fgesa algorithm by default.

The following results appeared to indicate changes involving molecular interactions and cellular structures. The similarity in “Biological process” invoked by the two drugs was prominent.

GSK

goPath = function(ont, gene) {
                enrichGO(gene = gene,
                        OrgDb = org.Hs.eg.db,
                        keyType = "SYMBOL",
                        ont = ont,
                        pAdjustMethod = "BH",
                        pvalueCutoff = 0.01,
                        qvalueCutoff = 0.05)
}

# Molecular functions
goGSK_MF = goPath("MF", DE_gene_GSK$Symbol)
print(head(goGSK_MF$Description, 10))
##  [1] "structural constituent of chromatin"        
##  [2] "protein heterodimerization activity"        
##  [3] "MHC class II protein complex binding"       
##  [4] "extracellular matrix structural constituent"
##  [5] "cytoskeletal motor activity"                
##  [6] "cargo receptor activity"                    
##  [7] "histone kinase activity"                    
##  [8] "MHC protein complex binding"                
##  [9] "single-stranded DNA binding"                
## [10] "single-stranded DNA helicase activity"
# Biological process
goGSK_BP = goPath("BP", DE_gene_GSK$Symbol)
print(head(goGSK_BP$Description, 10))
##  [1] "nucleosome assembly"                                
##  [2] "nucleosome organization"                            
##  [3] "nuclear division"                                   
##  [4] "meiotic cell cycle process"                         
##  [5] "protein-DNA complex assembly"                       
##  [6] "meiotic nuclear division"                           
##  [7] "organelle fission"                                  
##  [8] "meiotic cell cycle"                                 
##  [9] "protein localization to CENP-A containing chromatin"
## [10] "meiosis I cell cycle process"

DAC

# Molecular functions
goDAC_MF = goPath("MF", DE_gene_DAC$Symbol)
print(head(goDAC_MF$Description, 10))
##  [1] "structural constituent of chromatin" "protein heterodimerization activity"
##  [3] "sulfur compound binding"             "cytoskeletal motor activity"        
##  [5] "integrin binding"                    "heparin binding"                    
##  [7] "glycosaminoglycan binding"           "amyloid-beta binding"               
##  [9] "peptide antigen binding"             "gated channel activity"
# Biological process
goDAC_BP = goPath("BP", DE_gene_DAC$Symbol)
print(head(goDAC_BP$Description, 10))
##  [1] "nucleosome assembly"                                
##  [2] "nuclear division"                                   
##  [3] "nucleosome organization"                            
##  [4] "organelle fission"                                  
##  [5] "meiotic cell cycle process"                         
##  [6] "protein-DNA complex assembly"                       
##  [7] "meiotic nuclear division"                           
##  [8] "protein localization to CENP-A containing chromatin"
##  [9] "meiotic cell cycle"                                 
## [10] "regulation of nuclear division"

plot the “Biological Process” network. (Hard to customize the figure, so as is)

goplot(goGSK_BP, showCategory = 5)

goplot(goDAC_BP, showCategory = 5)

MSigDb gene set enrichment analysis

gesaPath = function(gene_set, DE_genes, nPermSimple = 10000) {
                pathways = gmtPathways(gene_set)
                ranks = DE_genes$log2FoldChange
                names(ranks) = rownames(DE_genes)
                gesa_result = fgsea(pathways = pathways, 
                                    stats = ranks,
                                    minSize = 15,
                                    maxSize = 500,
                                    nPermSimple = nPermSimple)
                return(gesa_result)
}

GSK

Curated (C2) pathways.

Result includes Many cancer- and methylation-related pathways.

c2_gene_set = "gmt-file/c2.all.v2024.1.Hs.entrez.gmt"  # Path to gmt file
c2_GSK = gesaPath(c2_gene_set, DE_gene_GSK)

c2_top10_up = c2_GSK[ES > 0][head(order(pval), 10), pathway]
cat("Top ten up-regulated pathways:\n")
## Top ten up-regulated pathways:
print(c2_top10_up)
##  [1] "YEGNASUBRAMANIAN_PROSTATE_CANCER"                                        
##  [2] "ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN"                                   
##  [3] "ZHONG_RESPONSE_TO_AZACITIDINE_AND_TSA_UP"                                
##  [4] "MIKKELSEN_MEF_HCP_WITH_H3_UNMETHYLATED"                                  
##  [5] "MUELLER_METHYLATED_IN_GLIOBLASTOMA"                                      
##  [6] "REACTOME_NEUROTRANSMITTER_RECEPTORS_AND_POSTSYNAPTIC_SIGNAL_TRANSMISSION"
##  [7] "RICKMAN_HEAD_AND_NECK_CANCER_B"                                          
##  [8] "HAMAI_APOPTOSIS_VIA_TRAIL_DN"                                            
##  [9] "SENGUPTA_NASOPHARYNGEAL_CARCINOMA_WITH_LMP1_DN"                          
## [10] "MCGARVEY_SILENCED_BY_METHYLATION_IN_COLON_CANCER"
c2_top10_down = c2_GSK[ES < 0][head(order(pval), 10), pathway]
cat("Top ten down-regulated pathways:\n")
## Top ten down-regulated pathways:
print(c2_top10_down)
##  [1] "PUJANA_XPRSS_INT_NETWORK"         "REACTOME_S_PHASE"                
##  [3] "CROONQUIST_NRAS_SIGNALING_DN"     "KAUFFMANN_DNA_REPAIR_GENES"      
##  [5] "WP_RETINOBLASTOMA_GENE_IN_CANCER" "MORI_LARGE_PRE_BII_LYMPHOCYTE_UP"
##  [7] "ISHIDA_E2F_TARGETS"               "KANG_DOXORUBICIN_RESISTANCE_UP"  
##  [9] "PUJANA_BRCA_CENTERED_NETWORK"     "REACTOME_SYNTHESIS_OF_DNA"
rank_GSK = DE_gene_GSK$log2FoldChange
names(rank_GSK) = rownames(DE_gene_GSK)
c2_pathways = gmtPathways(c2_gene_set)

plotEnrichment(c2_pathways[["ZHONG_RESPONSE_TO_AZACITIDINE_AND_TSA_UP"]], rank_GSK) +
            labs(title = "RESPONSE_TO_AZACITIDINE_AND_TSA_UP genes")

Hallmark pathways.

Metabolic alteration, DNA repair and immune responses.

h_gene_set = "gmt-file/h.all.v2024.1.Hs.entrez.gmt"
h_GSK = gesaPath(h_gene_set, DE_gene_GSK)

h_top10_up = h_GSK[ES > 0][head(order(pval), 10), pathway]
cat("Top ten up-regulated pathways:\n")
## Top ten up-regulated pathways:
print(h_top10_up)
##  [1] "HALLMARK_KRAS_SIGNALING_DN"         "HALLMARK_SPERMATOGENESIS"          
##  [3] "HALLMARK_MYOGENESIS"                "HALLMARK_FATTY_ACID_METABOLISM"    
##  [5] "HALLMARK_HYPOXIA"                   "HALLMARK_KRAS_SIGNALING_UP"        
##  [7] "HALLMARK_ADIPOGENESIS"              "HALLMARK_INTERFERON_ALPHA_RESPONSE"
##  [9] "HALLMARK_XENOBIOTIC_METABOLISM"     "HALLMARK_GLYCOLYSIS"
h_top10_down = h_GSK[ES < 0][head(order(pval), 10), pathway]
cat("Top ten down-regulated pathways:\n")
## Top ten down-regulated pathways:
print(h_top10_down)
##  [1] "HALLMARK_MTORC1_SIGNALING"          "HALLMARK_MYC_TARGETS_V1"           
##  [3] "HALLMARK_MITOTIC_SPINDLE"           "HALLMARK_IL2_STAT5_SIGNALING"      
##  [5] "HALLMARK_CHOLESTEROL_HOMEOSTASIS"   "HALLMARK_P53_PATHWAY"              
##  [7] "HALLMARK_INTERFERON_GAMMA_RESPONSE" "HALLMARK_COAGULATION"              
##  [9] "HALLMARK_DNA_REPAIR"                "HALLMARK_APOPTOSIS"

Reactome pathways.

reactome_set = "gmt-file/c2.cp.reactome.v2024.1.Hs.entrez.gmt"
react_GSK = gesaPath(reactome_set, DE_gene_GSK)

react_top10_up = react_GSK[ES > 0][head(order(pval), 10), pathway]
cat("Top ten up-regulated pathways:\n")
## Top ten up-regulated pathways:
print(react_top10_up)
##  [1] "REACTOME_NEUROTRANSMITTER_RECEPTORS_AND_POSTSYNAPTIC_SIGNAL_TRANSMISSION"   
##  [2] "REACTOME_ACTIVATION_OF_NMDA_RECEPTORS_AND_POSTSYNAPTIC_EVENTS"              
##  [3] "REACTOME_TRANSMISSION_ACROSS_CHEMICAL_SYNAPSES"                             
##  [4] "REACTOME_POST_TRANSLATIONAL_MODIFICATION_SYNTHESIS_OF_GPI_ANCHORED_PROTEINS"
##  [5] "REACTOME_POTASSIUM_CHANNELS"                                                
##  [6] "REACTOME_SENSORY_PERCEPTION"                                                
##  [7] "REACTOME_METABOLISM_OF_AMINO_ACIDS_AND_DERIVATIVES"                         
##  [8] "REACTOME_G_ALPHA_S_SIGNALLING_EVENTS"                                       
##  [9] "REACTOME_ION_CHANNEL_TRANSPORT"                                             
## [10] "REACTOME_NEURONAL_SYSTEM"
react_top10_down = react_GSK[ES < 0][head(order(pval), 10), pathway]
cat("Top ten down-regulated pathways:\n")
## Top ten down-regulated pathways:
print(react_top10_down)
##  [1] "REACTOME_S_PHASE"                                                        
##  [2] "REACTOME_SYNTHESIS_OF_DNA"                                               
##  [3] "REACTOME_MITOTIC_G1_PHASE_AND_G1_S_TRANSITION"                           
##  [4] "REACTOME_RMTS_METHYLATE_HISTONE_ARGININES"                               
##  [5] "REACTOME_G2_M_DNA_DAMAGE_CHECKPOINT"                                     
##  [6] "REACTOME_PROCESSING_OF_DNA_DOUBLE_STRAND_BREAK_ENDS"                     
##  [7] "REACTOME_HDMS_DEMETHYLATE_HISTONES"                                      
##  [8] "REACTOME_BASE_EXCISION_REPAIR_AP_SITE_FORMATION"                         
##  [9] "REACTOME_REPLACEMENT_OF_PROTAMINES_BY_NUCLEOSOMES_IN_THE_MALE_PRONUCLEUS"
## [10] "REACTOME_INHIBITION_OF_DNA_RECOMBINATION_AT_TELOMERE"
react_pathways = gmtPathways(reactome_set)

plotEnrichment(react_pathways[["REACTOME_DNA_METHYLATION"]], rank_GSK) +
            labs(title = "REACTOME_DNA_METHYLATION genes")

DAC

Curated (C2) pathways.

Similar responses to GSK.

c2_gene_set = "gmt-file/c2.all.v2024.1.Hs.entrez.gmt"  # Path to gmt file
c2_DAC = gesaPath(c2_gene_set, DE_gene_DAC)

c2_top10_up = c2_DAC[ES > 0][head(order(pval), 10), pathway]
cat("Top ten up-regulated pathways:\n")
## Top ten up-regulated pathways:
print(c2_top10_up)
##  [1] "YEGNASUBRAMANIAN_PROSTATE_CANCER"                
##  [2] "ACEVEDO_METHYLATED_IN_LIVER_CANCER_DN"           
##  [3] "RICKMAN_HEAD_AND_NECK_CANCER_B"                  
##  [4] "LIANG_SILENCED_BY_METHYLATION_2"                 
##  [5] "MIKKELSEN_MEF_HCP_WITH_H3_UNMETHYLATED"          
##  [6] "ZHONG_RESPONSE_TO_AZACITIDINE_AND_TSA_UP"        
##  [7] "HELLER_SILENCED_BY_METHYLATION_UP"               
##  [8] "MCGARVEY_SILENCED_BY_METHYLATION_IN_COLON_CANCER"
##  [9] "MUELLER_METHYLATED_IN_GLIOBLASTOMA"              
## [10] "TAVAZOIE_METASTASIS"
c2_top10_down = c2_DAC[ES < 0][head(order(pval), 10), pathway]
cat("Top ten down-regulated pathways:\n")
## Top ten down-regulated pathways:
print(c2_top10_down)
##  [1] "PUJANA_BRCA_CENTERED_NETWORK"                                             
##  [2] "ZHOU_CELL_CYCLE_GENES_IN_IR_RESPONSE_6HR"                                 
##  [3] "CROONQUIST_NRAS_SIGNALING_DN"                                             
##  [4] "MORI_IMMATURE_B_LYMPHOCYTE_DN"                                            
##  [5] "REN_BOUND_BY_E2F"                                                         
##  [6] "WP_RETINOBLASTOMA_GENE_IN_CANCER"                                         
##  [7] "REACTOME_S_PHASE"                                                         
##  [8] "KONG_E2F3_TARGETS"                                                        
##  [9] "REACTOME_DEPOSITION_OF_NEW_CENPA_CONTAINING_NUCLEOSOMES_AT_THE_CENTROMERE"
## [10] "REACTOME_ASSEMBLY_OF_THE_ORC_COMPLEX_AT_THE_ORIGIN_OF_REPLICATION"
rank_DAC = DE_gene_DAC$log2FoldChange
names(rank_DAC) = rownames(DE_gene_DAC)

plotEnrichment(c2_pathways[["ZHONG_RESPONSE_TO_AZACITIDINE_AND_TSA_UP"]], rank_DAC) +
            labs(title = "RESPONSE_TO_AZACITIDINE_AND_TSA_UP genes")

Hallmark pathways.

h_gene_set = "gmt-file/h.all.v2024.1.Hs.entrez.gmt"
h_DAC = gesaPath(h_gene_set, DE_gene_DAC)

h_top10_up = h_DAC[ES > 0][head(order(pval), 10), pathway]
cat("Top ten up-regulated pathways:\n")
## Top ten up-regulated pathways:
print(h_top10_up)
##  [1] "HALLMARK_MYOGENESIS"                       
##  [2] "HALLMARK_KRAS_SIGNALING_UP"                
##  [3] "HALLMARK_KRAS_SIGNALING_DN"                
##  [4] "HALLMARK_HYPOXIA"                          
##  [5] "HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION"
##  [6] "HALLMARK_XENOBIOTIC_METABOLISM"            
##  [7] "HALLMARK_INFLAMMATORY_RESPONSE"            
##  [8] "HALLMARK_HEME_METABOLISM"                  
##  [9] "HALLMARK_COAGULATION"                      
## [10] "HALLMARK_COMPLEMENT"
h_top10_down = h_DAC[ES < 0][head(order(pval), 10), pathway]
cat("Top ten down-regulated pathways:\n")
## Top ten down-regulated pathways:
print(h_top10_down)
##  [1] "HALLMARK_MYC_TARGETS_V1"          "HALLMARK_CHOLESTEROL_HOMEOSTASIS"
##  [3] "HALLMARK_MYC_TARGETS_V2"          "HALLMARK_MITOTIC_SPINDLE"        
##  [5] "HALLMARK_DNA_REPAIR"              "HALLMARK_ANDROGEN_RESPONSE"      
##  [7] "HALLMARK_FATTY_ACID_METABOLISM"   "HALLMARK_P53_PATHWAY"            
##  [9] "HALLMARK_TNFA_SIGNALING_VIA_NFKB" "HALLMARK_APOPTOSIS"

Reactome pathways.

reactome_set = "gmt-file/c2.cp.reactome.v2024.1.Hs.entrez.gmt"
react_DAC = gesaPath(reactome_set, DE_gene_DAC)

react_top10_up = react_DAC[ES > 0][head(order(pval), 10), pathway]
cat("Top ten up-regulated pathways:\n")
## Top ten up-regulated pathways:
print(react_top10_up)
##  [1] "REACTOME_NEUROTRANSMITTER_RECEPTORS_AND_POSTSYNAPTIC_SIGNAL_TRANSMISSION"
##  [2] "REACTOME_STIMULI_SENSING_CHANNELS"                                       
##  [3] "REACTOME_ACTIVATION_OF_NMDA_RECEPTORS_AND_POSTSYNAPTIC_EVENTS"           
##  [4] "REACTOME_ION_CHANNEL_TRANSPORT"                                          
##  [5] "REACTOME_TRANSMISSION_ACROSS_CHEMICAL_SYNAPSES"                          
##  [6] "REACTOME_DISEASES_ASSOCIATED_WITH_O_GLYCOSYLATION_OF_PROTEINS"           
##  [7] "REACTOME_PHASE_I_FUNCTIONALIZATION_OF_COMPOUNDS"                         
##  [8] "REACTOME_O_GLYCOSYLATION_OF_TSR_DOMAIN_CONTAINING_PROTEINS"              
##  [9] "REACTOME_CARDIAC_CONDUCTION"                                             
## [10] "REACTOME_PI3K_AKT_SIGNALING_IN_CANCER"
react_top10_down = react_DAC[ES < 0][head(order(pval), 10), pathway]
cat("Top ten down-regulated pathways:\n")
## Top ten down-regulated pathways:
print(react_top10_down)
##  [1] "REACTOME_DEPOSITION_OF_NEW_CENPA_CONTAINING_NUCLEOSOMES_AT_THE_CENTROMERE"     
##  [2] "REACTOME_S_PHASE"                                                              
##  [3] "REACTOME_ASSEMBLY_OF_THE_ORC_COMPLEX_AT_THE_ORIGIN_OF_REPLICATION"             
##  [4] "REACTOME_CHROMATIN_MODIFICATIONS_DURING_THE_MATERNAL_TO_ZYGOTIC_TRANSITION_MZT"
##  [5] "REACTOME_MATERNAL_TO_ZYGOTIC_TRANSITION_MZT"                                   
##  [6] "REACTOME_DNA_METHYLATION"                                                      
##  [7] "REACTOME_SENESCENCE_ASSOCIATED_SECRETORY_PHENOTYPE_SASP"                       
##  [8] "REACTOME_SIRT1_NEGATIVELY_REGULATES_RRNA_EXPRESSION"                           
##  [9] "REACTOME_NEGATIVE_EPIGENETIC_REGULATION_OF_RRNA_EXPRESSION"                    
## [10] "REACTOME_ERCC6_CSB_AND_EHMT2_G9A_POSITIVELY_REGULATE_RRNA_EXPRESSION"
plotEnrichment(react_pathways[["REACTOME_DNA_METHYLATION"]], rank_DAC) +
            labs(title = "REACTOME_DNA_METHYLATION genes")

Interactive figure for exploration

Interactive figures with a dynamic data table.

See Glimma vignette for instruction.

glimmaMA(dds, groups = dds$condition)
## 0 of 21700 genes were filtered out in DESeq2 tests

Conclusion

We demonstrated the process of obtaining NCBI RNA-seq raw count, selecting dataset and using it for differential expression and pathway analysis. The results were consistent with DNA demethylation effects as expected.

Resources

Love MI, et al. Analyzing RNA-seq data with DESeq2. https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

Akalin, A. Computational Genomics with R, Chapter 8, RNA-seq analysis. https://compgenomr.github.io/book/rnaseqanalysis.html

Gu Y, et al. Biomedical Knowledge Mining using GOSemSim and clusterProfiler. Part II: Enrichment analysis. https://yulab-smu.top/biomedical-knowledge-mining-book/enrichment-overview.html

Session info

sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] plotly_4.10.4               ggvenn_0.1.10              
##  [3] Glimma_2.12.0               RColorBrewer_1.1-3         
##  [5] ggplot2_3.5.1               stringr_1.5.1              
##  [7] dplyr_1.1.4                 pheatmap_1.0.12            
##  [9] hexbin_1.28.5               vsn_3.70.0                 
## [11] ReactomePA_1.46.0           msigdbr_7.5.1              
## [13] fgsea_1.28.0                clusterProfiler_4.10.1     
## [15] limma_3.58.1                EDASeq_2.36.0              
## [17] ShortRead_1.60.0            GenomicAlignments_1.38.2   
## [19] Rsamtools_2.18.0            Biostrings_2.70.3          
## [21] XVector_0.42.0              BiocParallel_1.36.0        
## [23] apeglm_1.24.0               DESeq2_1.42.1              
## [25] SummarizedExperiment_1.32.0 MatrixGenerics_1.14.0      
## [27] matrixStats_1.4.1           GenomicRanges_1.54.1       
## [29] GenomeInfoDb_1.38.8         org.Hs.eg.db_3.18.0        
## [31] AnnotationDbi_1.64.1        IRanges_2.36.0             
## [33] S4Vectors_0.40.2            GEOquery_2.70.0            
## [35] Biobase_2.62.0              BiocGenerics_0.48.1        
## 
## loaded via a namespace (and not attached):
##   [1] splines_4.3.2           BiocIO_1.12.0           bitops_1.0-9           
##   [4] ggplotify_0.1.2         filelock_1.0.3          tibble_3.2.1           
##   [7] R.oo_1.27.0             polyclip_1.10-7         preprocessCore_1.64.0  
##  [10] graph_1.80.0            XML_3.99-0.17           lifecycle_1.0.4        
##  [13] edgeR_4.0.16            lattice_0.22-6          MASS_7.3-60            
##  [16] crosstalk_1.2.1         magrittr_2.0.3          sass_0.4.9             
##  [19] rmarkdown_2.29          jquerylib_0.1.4         yaml_2.3.10            
##  [22] cowplot_1.1.3           DBI_1.2.3               abind_1.4-8            
##  [25] zlibbioc_1.48.2         purrr_1.0.2             R.utils_2.12.3         
##  [28] ggraph_2.2.1            RCurl_1.98-1.16         yulab.utils_0.1.8      
##  [31] tweenr_2.0.3            rappdirs_0.3.3          GenomeInfoDbData_1.2.11
##  [34] enrichplot_1.22.0       ggrepel_0.9.6           tidytree_0.4.6         
##  [37] reactome.db_1.86.2      codetools_0.2-20        DelayedArray_0.28.0    
##  [40] DOSE_3.28.2             xml2_1.3.6              ggforce_0.4.2          
##  [43] tidyselect_1.2.1        aplot_0.2.4             farver_2.1.2           
##  [46] viridis_0.6.5           BiocFileCache_2.10.2    jsonlite_1.8.9         
##  [49] tidygraph_1.3.1         bbmle_1.0.25.1          tools_4.3.2            
##  [52] progress_1.2.3          treeio_1.26.0           snow_0.4-4             
##  [55] Rcpp_1.0.13-1           glue_1.7.0              gridExtra_2.3          
##  [58] SparseArray_1.2.4       xfun_0.49               qvalue_2.34.0          
##  [61] withr_3.0.2             numDeriv_2016.8-1.1     BiocManager_1.30.25    
##  [64] fastmap_1.2.0           latticeExtra_0.6-30     digest_0.6.37          
##  [67] gridGraphics_0.5-1      R6_2.5.1                colorspace_2.1-0       
##  [70] GO.db_3.18.0            jpeg_0.1-10             biomaRt_2.58.2         
##  [73] RSQLite_2.3.9           R.methodsS3_1.8.2       tidyr_1.3.1            
##  [76] generics_0.1.3          data.table_1.16.4       rtracklayer_1.62.0     
##  [79] htmlwidgets_1.6.4       prettyunits_1.2.0       graphlayouts_1.2.1     
##  [82] httr_1.4.7              S4Arrays_1.2.1          scatterpie_0.2.4       
##  [85] graphite_1.48.0         pkgconfig_2.0.3         gtable_0.3.6           
##  [88] blob_1.2.4              hwriter_1.3.2.1         shadowtext_0.1.4       
##  [91] htmltools_0.5.8.1       scales_1.3.0            png_0.1-8              
##  [94] ggfun_0.1.8             knitr_1.49              rstudioapi_0.17.1      
##  [97] tzdb_0.4.0              reshape2_1.4.4          rjson_0.2.23           
## [100] nlme_3.1-166            coda_0.19-4.1           curl_6.0.1             
## [103] bdsmatrix_1.3-7         cachem_1.1.0            parallel_4.3.2         
## [106] HDO.db_0.99.1           restfulr_0.0.15         pillar_1.10.0          
## [109] vctrs_0.6.5             dbplyr_2.5.0            evaluate_1.0.1         
## [112] readr_2.1.5             GenomicFeatures_1.54.4  mvtnorm_1.3-2          
## [115] cli_3.6.2               locfit_1.5-9.10         compiler_4.3.2         
## [118] rlang_1.1.3             crayon_1.5.3            labeling_0.4.3         
## [121] interp_1.1-6            aroma.light_3.32.0      emdbook_1.3.13         
## [124] affy_1.80.0             plyr_1.8.9              fs_1.6.5               
## [127] stringi_1.8.4           viridisLite_0.4.2       deldir_2.0-4           
## [130] babelgene_22.9          munsell_0.5.1           lazyeval_0.2.2         
## [133] GOSemSim_2.28.1         Matrix_1.6-1.1          patchwork_1.3.0        
## [136] hms_1.1.3               bit64_4.5.2             KEGGREST_1.42.0        
## [139] statmod_1.5.0           igraph_2.1.2            memoise_2.0.1          
## [142] affyio_1.72.0           bslib_0.8.0             ggtree_3.10.1          
## [145] fastmatch_1.1-6         bit_4.5.0.1             gson_0.1.0             
## [148] ape_5.8-1